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Abstract 

In this paper we extend existing Bayesian methods for variable selection in Gaussian process 
regression, to select both the regression terms and the active covariates in the spatial correla¬ 
tion structure. We then use the estimated posterior probabilities to choose between relatively 
few modes through cross-validation, and consequently improve prediction. 


1. Introduction 

Gaussian processes have been employed extensively both in regression and classification 
problems in machine learning (see Rasmussen and Williams 2006). Their earliest use in sta¬ 
tistical modeling may have been for fitting Kriging predictors in geostatistics (see e.g. Krige 
1951, Cressie 1993) and they have been the most popular metamodeling approach to com¬ 
puter simulations (see Santner et al. 2003, chapter 3). 

The problem of variable selection in Gaussian process regression has been addressed by Lin- 
kletter et al. (2006), while Savitsky et al. (2011) extended their results to generalized linear 
nonparametric models. In both cases no regression terms were included, and dependence on 
the factors was modeled by the authors solely via the spatial correlation structure. 

In many applications only interpolation or smoothing are required, and the omission of re¬ 
gression terms can simplify the fitting of the Kriging metamodel, with little expense in pre¬ 
dictive precision (see e.g. Sacks et al. 1989). It is also known that these models have a strong 
“Kriging towards the mean” tendency at sites far away from the training data, and it is thus 
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often essential to add a trend for effective extrapolation (see Journel and Rossi 1989 for the 
original argument). Currently, the common practice is to either include all of the covariates 
both in the correlation structure and the linear component (i.e. Universal Kriging) or to omit 
the linear part altogether and model the data as observations from a pure (weakly) stationary 
random process (i.e. Ordinary Kriging). Joseph et al. (2008) proposed a Bayesian procedure 
for designed computer experiments, to select first and second order polynomials for the mean 
surface while leaving the covariance terms untouched. In this work we allow for exclusion 
of covariates from the correlation structure, along with their possible inclusion in the trend 
component, in the case that their only contribution to the output is linear. 

2. Model Assumptions 

Let {xi,... ,x n } C X be our training set, where X C W is the experimental region, and let 
y = [y (a?i) ,,y (a? n )] T be a vector of observations from the model 

y (*) = A) + x J (3 + z (x) + £ (x) , (l) 

where /3 0 is a constant, / 3 = \(3i,..., /3 P ] J is a vector of regression coefficients, Z (x) is a 
zero mean stationary Gaussian process with marginal variance a 2 z and a separable correlation 
function of the form 

R(u,v;p) = f[pp- Vi) * (2) 

3 = 1 

for some vector of correlation parameters p = [pi,..., p p ] J , 0 < pj < 1 , 1 < j < p, and 
£ (cc) is a white noise process, independent of Z (x), with variance parameter a 2 e . Note that 
the closer pj is to 1, the less Z ( x ) tends to vary in the yth direction, and vice versa. This 
point is illustrated in Figure 1, in which a single realization of Z ( x ) is drawn, with p — 2, a 
relatively large value of p\ and an extremely small value of p 2 - 

Denoting A = a 2 fa 2 , it can be easily verified that 

y\Po:P,a 2 z ,p,\~J\f {/Wn + Xf3, a 2 z [R (, p ) + XI}} , (3) 

where X is the design matrix and 

P 2 

Rim (P) = R(X U X m ) = [] P { P. 

3 = 1 
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Figure 1: A realization of Z (x) with the correlation function (2), for p = 2,p\ = 0.75 and p 2 = 5 x 10 6 . 

We can interpret (1) as follows: each of the p covariates in x may (or may not) make a 
linear contribution to the response y ( x ). If, in addition, it makes a nonlinear contribution, 
it should be captured by Z ( x ), whose flexible nature makes it free of any rigid structure. 
The white noise term, also known as the “nugget” (see Andrianakis and Challenor 2012) 
and interpreted as “measurement error”, is often omitted when using the Kriging metamodel 
to emulate deterministic computer simulations. However, its (possibly unnatural) inclusion 
has certain benefits, both numerical and in terms of the average squared prediction error. 
The consequent A term in (3) is mostly identified with the equivalent penalized least squares 
minimization problem in the Reproducing Kernel Hilbert Space corresponding to R(-, ■) (see 
Wahba 1990). In this paper, we would l ik e to examine the possibility of leaving some of the 
covariates out of the correlation function. By doing that, we deliberately omit variables that 
are (in the case of computer experiments) known for a fact to be part of the computer code. 
It is then reasonable to add an error term, to avoid interpolation at the learning set, in favor 
of better predictions at new sites. 

3. Formalizing as a Bayesian Model 

We would now like to turn our attention to the problem of variable selection in model (1). 
Research on Bayesian variable selection in linear regression models has been thoroughly 
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documented - Chipman et al. (2001) give a detailed account of some theoretical and compu¬ 
tational aspects Linkletter et al. (2006) and Savitsky et al. (2011) apply point-mass mixture 
prior distributions for the correlation parameters of simple Kriging models. In this work, 
we aim to propose a unified framework, in which “spike-and-slab” priors, like those used by 
Linkletter et al. (2006) and Savitsky et al. (2011), are employed to select the components of 
both the regression and spatial parts of the model. 

3.1. Writing the Posterior Distribution 

To put the above to practice, we now assume that the regression coefficients are independent, 
with prior distributions 

/3j |uy ~ u r J\f (0, t 2 ) + (1 — co r ) <5o , (4) 

where d], is a point mass at zero. Alternatively, one can denote by 7 r = (y[,... , 7 a 
vector of latent variables, in which 7J = 1 {/ 3 j ^ 0} indicates whether or not the yth co¬ 
variate is included in the linear component. We may then write 7 J \u r ~ Binom (l,uy) and 

Pj\lj ,l r jT 2 ) . 


Regarding the spatial component, for the jth variable to be inert, its corresponding correlation 
factor pj must be equal to 1. An analogue of (4) would then be 

Pj\ uj c ~ Lo c U (0,1) + (1 - U! c ) 5 1 , (5) 

implemented by defining the vector of latent variables j c = (y(,..., 7 “), where 7 ? \ui c ~ 
Binom (1, cu c ) and 

71 (Plhj) = Y W(0,1) + (1 - 7 j) Si (pj) ■ 

Using Bayes rule, we may now write the posterior distribution 

P {Y,l C ,Ur,Uc,(3o,P,P, cr 2 z ,\\y) 


ocp (y\/3 0 ,{3,a 2 z ,p, A) vr (Y,P\ur) vr (7 c ,p|u; c ) x vr (/ 3 0 ,cr 2 z , \,u r ,u c ) , (6) 

where p (y |/? 0 , /3, <x 2 , p, A) is the multivariate Normal density, as indicated in (3), 

7 T (7 r , P\u r ) = LUp 3 ^ (1 - UO r )^i 1 { 7 I=°} (j) T (^.) 


7, r =l 


(here p T (/ 3j ) is the Normal pdf with zero mean and standard deviation r, evaluated at pj) and 

/ c I \ Ej 1 {7 J c =l}/ 1 w i{ 7 c =o| 

7 T (7 ,p\U c ) = COc J (1 — Ulc ) 2 ^ 3 i . 
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3.2. Markov Chain Monte Carlo Sampling from the Posterior Distribution 
We now apply a Metropolis Sampling algorithm to draw a random sample from the posterior 
distribution. A suitable ‘jump’ (or ‘proposal’) distribution is required - one that will allow 
the resulting Markov chain to arrive at any possible state with positive probability. Here we 
extend the key ideas of Savitsky et al. (2011), to propose a full Metropolis-Hastings schema: 


1. Update ( 7 r , /3 ) and (-y c , p) : draw k G {0,1,..., 2 p} from a Binom(2p, u) distribu¬ 
tion (typically v = (2p) ') and change the present state of k randomly chosen latent 
variables out of (-y r , 7 C ). 


- If the value of 7 J changes from 0 to 1, sample from a A f (0, r 2 ) proposal. 

- If the value of 7 ? changes from 0 to 1, sample pj from aW(0,1) proposal. 

- If the value of 7 J changes from 1 to 0, set (3j = 0. This results in covariate Xj 
being omitted from the linear trend in the current iteration. 

- If the value of 7 c - changes from 1 to 0, set pj = 1. This results in covariate x 3 
being omitted from the correlation function in the current iteration. 

* In any of the above cases, add a small jitter to the remaining (3 and p values (while 
making sure all the pj -s remain in the interval ( 0 , 1 )). 


2. Update (/3 0 , o' 2 , A, uy, lu c ) : to simplify matters, transform the parameters to 

p = log (a 2 ) ; C = log(A) ; ip r = log 

1 — co r 

and 

A = log Uc , 

1 - W c 

and sample (/3 0 , p, (, U r , C c ) from a multivariate Normal distribution about the present 
state, (/3 0 , Pi Ci U r , up). In our implementation we used independent components. Re¬ 
member to take the Jacobian term 


<9(cr 2 , A ,uj r ,uj c ) 
d(piCifrAc) 


P (j uj t (1 co r )oj c (^\ 


into account in the posterior later. 
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3. The proposed vector ^ 7 r , /3, 7 C , p, /3 0 , (A, A, dy, is accepted with probability 

( P (Y,Y,Po,P,pY 2 z , \,uj r ,uj c \y) 

a = min < 1, —-i—— 

I P{7 r ,YYo,P,P,YY,^r^c\y) 

Note that (7) holds, since by construction, the proposal distribution described above is 
symmetric, i.e. 

P (V, A Y, p\Y, P, Y, p) =P (j r , P, Y, p|7 r , P, Y, p) 

and 

p (/3o,cr^,A|/3o,cr^, a) =p^o,^,A,a; r ,a;c|^o,cr2,A,a; r ,a; c ) . 



4. Prediction Strategies 

We now wish to take advantage of the procedure described in Section 3.2, to make accurate 
predictions at m new sites X new . We will propose two possible approaches, the first of which 
- based on model averaging - is a more “orthodox” Bayesian approach, while the second one, 
in which we merely use the schema of Section 3.2 to identify the MAP model, might perhaps 
be characterized as opportunistic. 


4.1. Prediction based on Model Averaging 

Conditional on © = (/? 0 , (3, p, o 2 , A), the joint distribution of y Q ld = y (X old ) and y new = 
y (X new ) is given by 

X/riew 

Void 



where 


■^new H“ m 

-^new,old 

dT 

■^new,old 

Rold T A I n 


and -R ne w,old is the cross-correlation matrix 


TfiJ 

■^neWjOld 




t i 

'new’ 1 


C oldJ 
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It is therefore straightforward to show that 


E 


Vn 


2/old, © r — A)1 m H“ X new (3 


R 


-^old “t” XI n (2/old A)f n -^old/3) 


new,old 


( 8 ) 


and predictions, based on a sample {©d^Li from the posterior distribution (as described in 
section 3.2), can be made by integrating © out, namely 


y 


new 


E 


■j^ 2/new 



/ 


E 


Vv 


Void, © \ p (©I 2/old) cl© 


N 


(9) 


i=1 

This is, in fact, the usual model averaging estimator (see e.g. Ghosh et al. 2006, Chapter 9). 
If the set X new is known in advance, (8) can be calculated on the fly (within each Metropolis 
iteration), and significant calculation time can be saved, depending on the acceptance rate of 
our sampling - provided that no new predictions are made whenever a new vector is rejected. 
It is also worthwhile to consider “denoising” through elimination of highly unlikely models 
from (9), by setting exceptionally small posterior probabilities (according to a predetermined 
threshold) to zero. 


4.2. Prediction based on Variable Selection 

Another obvious option is to regard the Bayesian mechanism of Section 3 as a mere selection 
procedure to identify influential (either linear or nonlinear) variables. In that case, inert vari¬ 
ables in either the regression or the correlation part, will be set to 0 or 1, respectively, and the 
prediction at location x will be 


y(x) = x J f3 + r I (as) 


Rp + XI 


-i 


y-x(3 


( 10 ) 


where y/3,p,\J are the maximum likelihood estimators of (j3,p,X), obtained by solving 
iteratively 


\ ( -r 

r ^ i 

- 1 V 1 

r ^ i 

H* T 

Rp + XI 


Rp + XI 


-l 


r M A P, A = n [y - X(3 


Rp + XI 


i -i 


y , 


y-x/3 , 


V 



argmm 

p,\ 


n log a l(3(p,X) ,p,X) + log \R P + XI 
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Details of the optimization routine are given in Roustant et al. (2012). 


Denoting M. = ( 7 '’, 7 C ), our model of choice could then be the highest posterior probability 
model, i.e. 

M* — argrnax p (M.\y) 

M 

(to which we will refer in the text as the maximum a posetriori model , or simply - the MAP 
model), or, alternatively, we may set a threshold q on the posterior inclusion probability 

Pi = p (Me\v) , 

&7»=1 

(where 7 ,; may be either or 7 ?), and select the variable Xi to either part, only if p, > q. 
The choice of q — \ leads to what is famously known as the ‘median model’, which - in 
the simpler case of linear regression with uncorrelated noise - has been shown by Barbieri 
and Berger (2004) to be the best predictive model, under very strict conditions on both the 
prior distributions and the covariates where predictions are to be made. In Section 5 we set 
the threshold on the posterior inclusion probability at a relatively high value of 0.8. A more 
educated choice of that value is proposed in Subsection 6.1, based on v-fold cross-validation. 

5. A Simulation Study 

We now wish to test the proposed methods on simulated data. For that purpose, we used 
a 35-run maximin Latin hypercube design (see Jones and Johnson 2009) in [— 0.75, 0.75 ] 5 to 
sample noisy data from the allegedly 5 dimensional function 

37 T-X" 2nX 71 X 

y ( x ) = 3X 2 + 4 X 3 + 5 X 4 + 5 cos —^ + 4 cos —^ + 3 cos —^ + e(x) , (11) 

where£(:r) J\f (0,0.1 2 ) . Note that while (11) consists of both linear and non-linear parts, 
the variables X 4 and X 4 are absent from the linear and non-linear parts, respectively, and X 5 
is inert. Figure 2 shows how the different (additive) components of function (11) vary as 
functions of their respective variables. 

We assigned prior distributions as follows: 

&|tJ = 1 ~-A/"(0,5 2 ) , j — 1,... ,5 ; uv,u; c ~w(0,l) ; /3 0 ~ U (0,10 2 ) ; 
a 2 z ~ Inv-T(3, 2) and A ~ Inv-T(3, 0.2) , 
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Figure 2: Plotting function (11) against each of X± - X 4 . 

and calculated the root mean square prediction error (RMSPE) on an independent 100-run 
maximin LHD in the same region. A size 250, 000 sample was drawn from the posterior 
distribution, as described in Section 3.2, after the first 5000 were discarded as burn-in. 


p Posterior Distributions 


p Posterior Distributions 



Figure 3: Box plots of the posterior distributions of the regression and correlation parameters, for the simulated 
data. 

Figure 3 shows the posterior distributions of both the regression coefficients and the correla¬ 
tion parameters, while Figure 4 shows their respective marginal posterior inclusion probabil¬ 
ities. For now, we set the inclusion limit at 0.8, leaving us with a single regression term (X 4 ) 
in the trend part of the model, and three covariates in the nonlinear component (X\, X 2 and 
X 4 ). This is also consistent with Figure 3. Apparently, the Gaussian process term effectively 
reflects the linear effects of X 2 and X 3 as well as their nonlinear effects, which explains why 
they were dropped from the regression component. As expected, Figure 3 clearly identifies 
X\ as inactive in the trend part and X 4 in the non-linear component, while X 5 is deemed inert 
beyond any doubt. 
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Posterior Inclusion Probabilities (Regression) 


Posterior Inclusion Probabilities (Correlation) 
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Figure 4: The posterior inclusion probabilities for the simulated data. The dashed line marks 80% posterior 
inclusion frequency. 
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Figure 5: Predictions vs. true values for the simulated example. 

Table 1 contains a comparison of the empirical RMSPE of several fitted models when tested 
on the 100-run validation set. The models include ordinary kriging (OK, no linear trend 
at all), universal kriging (UK, a linear trend which includes all of the covariates), model 
averaging based on (9), variable selection based on the posterior inclusion probability (0.8 at 
least) and the empirical highest posterior probability (MAP) model. 

Table 1: Comparison of several prediction methodologies for the simulated data. 


Method 

RMSPE 

OK 

0.168 

UK 

0.130 

Averaging 

0.247 

Posterior Inclusion 

0.115 

MAP 

0.124 
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Interestingly, the MAP model in this example was the one containing X 2 , X 3 and X 4 in its 
linear part, along with X\, X 2 and A", in its non-linear part, compatible with the mechanism 
generating the data. However, it was outperformed by the model selected according to the 
posterior inclusion probabilities. It is our recommendation to resist the temptation to blindly 
opt for the empirical MAP model, when it is unclear whether or not it is the true highest 
posterior model - in particular when the posterior distribution is multimodal or flat near its 
mode. In this example, not for the only time in this paper, model averaging proved to be 
a poor strategy. One possible reason could be poor estimation of the posterior distribution, 
resulting in excessive weight on low priority models. Figure 5 demonstrates the performance 
on the hold-out set of the model selected by the posterior inclusion probabilities. 

6. Applications 

6.1. Heat Exchanger Emulator 

We now wish to test our method on the computer simulation described in Qian et al. (2006). 
Here, the total rate of steady state heat transfer of a heat exchanger for electronic cooling 
applications is modelled as a function of four input factors: the mass flow rate of entry air, 
the temperature of entry air, the solid material thermal conductivity and the temperature of 
the heat source. The response was evaluated at a 64-run experimental design, constituting an 
orthogonal array-based LHD, and produced by a finite difference code. An additional 14 run 
test set was used to validate the resulting emulator. Input variables were standardized into the 
[0,1] region before analysis, to overcome the different scales. Using the prior distributions of 
Section 5, a sample of size 155, 000 was drawn from the posterior distribution, with the first 
5000 serving as bum-in. 

Previously, only interpolators have been attempted, as the simulation itself is deterministic. 
Qian et al. (2006) reported a RMSPE of 2.59 using a universal Kriging model with a linear 
trend, compared to 5.15 achieved by ordinary Kriging, while Ba and Joseph (2012) reduced 
the RMSPE to 2.24. Lately, Harari and Steinberg (2014a) managed an RMSPE of 1.93, using 
a convex combination of two independent Gaussian processes to model the spatial structure. 
Here we allowed for a non-zero nugget effect. The different methods are compared in Table 
2. One interesting feature is that x 4 has a strong linear effect but is slightly less dominant 
in the covariance. The other factors are modeled primarily via the covariance. This slightly 
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deviates from the work of Qian et al. (2006), who used linear terms for x 2 and x 4 , and mod¬ 
elled the correlation through all four factors. 

The dashed, horizontal line in Figure 7 marks the cutoff used in this example to omit terms 
from the model. The cutoff - 77% - was obtained via the following procedure: only terms with 
at least a 30% marginal inclusion probability were considered, and terms with 90% marginal 
inclusion probability or more were automatically included. That left 4 competing models, of 
which the eventual model was chosen via 8-fold cross-validation (fitting the model, based on 
56 observations and testing on the remaining 8 each time). Figure 11 shows the results, along 
with one-standard error marks, potentially serving to prioritize sparse models, in a similar 
manner to CART models (see Breiman et al. 1984, chapter 3). Generally, using this proce¬ 
dure leaves us with at most 2 p models to compare (and, in practice, typically far less), instead 
of 2 2p in an exhaustive search. As evident from Figure 8, in this example, a model consisting 
of 4 predictors (A"i in the linear part and X 2 , X 3 and X 4 in the stochastic part), was selected. 


p Posterior Distributions 


p Posterior Distributions 


o 

CM 




Figure 6: Box plots of the posterior distributions of the regression and correlation parameters for the heat 
exchanger data. 


6.2. The Boston Housing Data 

To test our method on a larger-scale problem, we now analyze the Boston Housing data set, 
which also appeared in Savitsky et al. (2011), and - most famously - in Breiman and Friedman 
(1985), where a full description of the 13 predictors - related to the median value of owner- 
occupied homes in census tracts of the Boston metropolitan area - can be found. The data 
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Posterior Inclusion Probabilities (Regression) 


Posterior Inclusion Probabilities (Correlation) 
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Figure 7: The posterior inclusion probabilities for the heat exchanger simulation. The dashed line marks a 77% 
posterior inclusion level. 
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Figure 8: 8-fold cross-validation results for the Heat Exchanger simulation data, including 1-SE marks. 
Table 2: Comparison of several prediction methodologies for the heat exchanger example. 


Method 

RMSPE 

OK 

2.023 

UK 

2.670 

Averaging 

2.686 

Postrior Inclusion 

1.793 

MAP 

1.793 


set contains 506 observations, of which we used 130 as a training set and the other 376 for 
validation. Again, we transformed the predictors so that they are all in the [0,1] interval. We 
produced a sample from the posterior distribution of size 250, 000 (with an additional burn-in 
sample of size 15, 000), and the same set of priors as in Section 5. The resulting marginal 
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posterior distributions are shown in Figure 9, where it is evident that none of the covariates 
makes a pure, linear contribution to the median property value, and that the 6th (average 
number of rooms per dwelling), 10th (full-value property-tax rate per $10, 000), 11th (pupil- 
teacher ratio by town) and 13th (% lower status of the population) covariates stand out in the 
non-linear part of the model. 


p Posterior Distributions 


p Posterior Distributions 




Figure 9: Box plots of the posterior distributions of the regression and correlation parameters, for the Boston 
Housing data. 
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Figure 10: The posterior inclusion probabilities for the Boston Housing data set. The dashed line marks a 76% 
posterior inclusion frequency. 

Figure 10, where the marginal posterior inclusion probabilities are shown, is very much in 
line with Figure 9. As previously in 6.1, the cutoff - 66% - was determined through a 10- 
fold cross-validation procedure, shown in Figure 11. In this example, the most parsimonious 
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model, consisting of only 4 predictors, was singled out as the obvious choice. 



0.44 0.52 0.66 0.95 


q 

Figure 11: 10-fold cross-validation results for the Boston Housing data, including 1-SE marks. 

In Figure 12, the reader can see how the emulator values y ( x ) vary for each of the covariates. 
We evaluated (10) at numerous sites on a regular grid, and the plotted values in each graph 
were obtained by averaging on the other three variables. It should come as no surprise to 
the reader, based on Figure 9, that X () and X ]:i are the variables accounting for the dominant 
nonlinear effects, while the output seems to be far less sensitive to changes in the values of 
Xio and X n , consequence of their p estimates of 0.392 and 0.668, respectively. Thorough 
discussion of sensitivity analysis can be found in Saltelli et al. (2008). While the reader 
may be misled by Figure 12 to think that X w and X u are linear effects in this example, we 
feel the need to emphasize that these plots were obtained by averaging, and do not capture 
interactions. That also explains the flat (or even slightly decreasing) average behavior of 
y ( x ) for the lower values of Xq (average number of rooms per dwelling). 

It is interesting to note that while the ordinary kriging and universal kriging models achieved 
empirical RMSPEs of 4.02 and 4.01, respectively, on the hold-out set, the selected model, 
based on 4 predictors, achieved a RMSPE of 3.98. Although prediction is not always the 
main purpose of variable selection procedures of this sort, it is still encouraging to find out 
that dimensionality reduction of this magnitude does not necessarily come at the expense 
of predictive accuracy. For further reference, we used the Treed Gaussian Process (TGP) 
model, a nonstationary model, developed and implemented by Gramacy and Lee (2008), 
which combines partitioning of the predictor space with local, stationary GP models. In this 
example, the TGP model achieved a RMSPE of 4.16. 
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Xio 

Figure 12: The predicted value y(x) for the Boston Housing data, plotted against X§, Am, An and A 13 , 
averaged over all other variables in each plot. 

7. Discussion 

In this work, we explored variable selection in Bayesian semiparametric regression models. 
Although the primary goal of these procedures is often of qualitative or explanatory nature, 
we keep one eye open on the accuracy of our predictions. With that in mind, our variable 
selection procedure is driven by prediction throughout, as evident from the cross-validation 
procedure described in Subsections 6.1 and 6.2, and unlike the frequentist-like selection pro¬ 
cedure of Linkletter et al. (2006), where identification of the active inputs was the sole pur¬ 
pose. 

Ideally, we would like to take advantage of the Bayesian framework to provide accurate 
predictions, using (9). However, using model averaging to a good effect might require an 
educated choice of prior distributions, along with possibly vast samples from the posterior 
disributions. One very interesting open question is whether a set of sparsity prior distribu¬ 
tions exists, such that under some aggregation methods, the minimax convergence rate of the 
mean squared prediction error can be achieved, as in the exponential weighting scheme of 
Rigollet and Tsybakov (2012) for linear regression models. 


Naturally, one may consider replacing model (1) with universal kriging models of the form 
y (x) = p 0 + f (;c) T (3 + Z (x) + £ (x), where f (x) = [/1 , f r 0)] T for some dic¬ 

tionary of regression functions. In that case, extra caution needs to be taken, as choice of 
regression functions already in the reproducing kernel Hilbert space associated with the co- 
variance kernel could cause a multicollinearity-like effect (see Bursztyn and Steinberg 

2004), resulting in the omission of crucial factors from our model. 
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One question left unanswered in this paper, and which may be the basis for future research, is 
whether or not one should include some linear terms for extrapolation purposes, even when 
the selection process indicates otherwise. Based on the artificial example of Section 5, it 
looks very likely that the model will drop most of the regression functions and rely on the 
covariance kernel. It would therefore be interesting to see how devastating an effect that 
would have on predictions outside the convex hull of the training data. Another possibility 
is to begin by “orthogonalizing” the covariance kernel to linear terms, by subtracting out the 
initial components of its Mercer expansion. Using Mercer’s expansion in computer experi¬ 
ments has been covered by Harari and Steinberg (2014b) and Fedorov and Flanagan (1998), 
among others. 

Supplementary Material 

R code and data files: R code for all sections and data files can be downloaded by the readers 
as a .zip file. 

References 

Andrianakis, I., Challenor, P., 2012. The Effect of the Nugget on Gaussian Process Emulators 
of Computer Models. Computational Statistics and Data Analysis 56 (12), 4215-4228. 

Ba, S., Joseph, V., 2012. Composite Gaussian Process Models for Emulating Expensive Func¬ 
tions. Annals of Applied Statistics 6 (4), 1838-1860. 

Barbieri, M., Berger, J., 2004. Optimal Predictive Model Selection. The Annals of Statistics 
32 (3), 870-897. 

Breiman, L., Friedman, J., 1985. Estimating Optimal Transformations for Multiple Regres¬ 
sion and Correlation. Journal of the American Statistical Association 80 (391), 580-598. 

Breiman, L., Friedman, J., Olshen, R., Stone, C., 1984. Classification and Regression 
Trees. Statistics/Probability Series. Wadsworth Publishing Company, Belmont, California, 
U.S.A. 

Bursztyn, D., Steinberg, D., 2004. Data Analytic Tools for Understanding Random Field 
Regression Models. Technometrics 46 (4), 411-420. 


17 



Chipman, H., George, E., McCulloch, R., 2001. The Practical Implementation of Bayesian 
Model Selection. Institute of Mathematical Statistics, Beachwood, OH, pp. 65-116. 

Cressie, N., 1993. Statistics for Spatial Data. Wiley-Interscience. 

Fedorov, V., Flanagan, D., 1998. Optimal Monitoring Network Design Based on Mercer’s 
Expansion of Covariance Kernel. Journal of Combinatorics, Information & System Sci¬ 
ences 23 (1-4), 237-250. 

Ghosh, J., Delampady, M., Samanta, T., 2006. An Introduction to Bayesian Analysis: Theory 
and Methods. Springer. 

Gramacy, R., Fee, H., 2008. Bayesian Treed Gaussian Process Models With an Application 
to Computer Modeling. Journal of the American Statistical Association 103 (483), 1119— 
1130. 

Harari, O., Steinberg, D., 2014a. Convex Combination of Gaussian Processes for Bayesian 
Analysis of Deterministic Computer Experiments. Technometrics (to appear). 

Harari, O., Steinberg, D., 2014b. Optimal Designs for Gaussian Process Models via Spectral 
Decomposition. Journal of Statistical Planning and Inference (to appear). 

Jones, B., Johnson, R., 2009. The Design and Analysis of the Gaussian Process Model. Qual¬ 
ity and Reliability Engineering International 25 (5), 515-524. 

Joseph, V., Hung, Y., Sudjianto, A., 2008. Blind Kriging: A New Method for Developing 
Metamodels. Journal of Mechanical Design 130 (3), 031102. 

Journel, A., Rossi, M., 1989. When Do We Need a Trend Model in Kriging? Mathematical 
Geology 21 (7), 715-739. 

Krige, D., 1951. A statistical Approach to some Mine Valuation and Allied Problems on the 
Witwatersrand. Master’s thesis, University of the Witwatersrand. 

Linkletter, C., Bingham, D., Hengartner, N., Higdon, D., Ye, K., 2006. Variable Selection for 
Gaussian Process Models in Computer Experiments. Technometrics 48 (4), 478-490. 

Qian, P, Seepersad, C., Joseph, V., Allen, J., Wu, C., 2006. Building Surrogate Models with 
Detailed and Approximate Simulations. ASME Journal of Mechanical Design 128 (4), 
668-677. 


18 



Rasmussen, C., Williams, C., 2006. Gaussian Processes for Machine Learning. The MIT 
Press. 

Rigollet, P, Tsybakov, A., 11 2012. Sparse Estimation by Exponential Weighting. Statistical 
Science 27 (4), 558-575. 

Roustant, O., Ginsbourger, D., Deville, Y., 2012. DiceKriging, DiceOptim: Two R Packages 
for the Analysis of Computer Experiments by Kriging-Based Metamodeling and Optimiza¬ 
tion. Journal of Statistical Software 51 (1), 1-55. 

Sacks, J., Schiller, S., Welch, W., 1989. Designs for Computer Experiments. Technometrics 
31 (1), 41-47. 

Saltelli, A., Ratto, M., Andres, T., Campolongo, F., Cariboni, J., Gatelli, D., Saisana, M., 
Trantola, S., 2008. Global Sensitivity Analysis: the Primer. John Wiley. 

Santner, T., Williams, B., Notz, W., 2003. The Design and Analysis of Computer Experi¬ 
ments. Springer. 

Savitsky, T., Vannucci, M., Sha, N., 2011. Variable Selection for Nonparametric Gaussian 
Process Priors: Models and Computational Strategies. Statistical Science 26 (1), 130-149. 

Wahba, G., 1990. Spline Models for Observational Data. Society for Industrial and Applied 
Mathematics. 


19 



